Scalable, Parameter- and Memory-Efficient Pretraining for Large Language Models
Recent Algorithmic Advances and Comprehensive Benchmarking
Authors: A. Glentis et al.
Published on Arxiv: 2025-05-28
Link: http://arxiv.org/abs/2505.22922v1
Institutions: University of Minnesota • Peking University • University of Sydney
Keywords: large language models, parameter-efficient pre-training, memory-efficient optimization, low-rank factorization, weight refactorization, momentum reset, GaLore, Fira, SLTrain, LLaMA, LoRA, C4 dataset, scaling laws, AdamW, model compression, benchmarking
The exponential growth in the scale of large language models (LLMs), now reaching trillions of parameters, is driving significant challenges for both computation and memory, especially during pre-training and fine-tuning phases. Techniques for parameter-efficient fine-tuning like LoRA have succeeded in downstream tasks, but applying such efficiency methods directly to LLM pre-training remains difficult due to scale and data requirements.
To address these issues, the authors conducted an in-depth examination of current strategies and proposed new practical improvements:
- Comprehensive review of state-of-the-art parameter- and memory-efficient pre-training methods, focusing on those evaluated for LLM pre-training.
- Benchmarking leading memory-efficient pre-training approaches, including memory-efficient optimizers (GaLore, Fira) and weight factorization (Low-rank, LoRA, SLTrain) across various LLaMA model sizes (60M to 1B parameters) using the C4 dataset.
- Rigorous optimization technique comparison using hyperparameter sweeps, and best practices like momentum reset and adaptive gradient clipping to ensure fair baselines.
- Introduction of two practical innovations: weight refactorization (periodic SVD updates of factorized weights), and momentum reset (periodically zeroing optimizer momentum in AdamW), to boost the efficiency and performance of low-rank/SLTrain methods.
Building on these approaches, their benchmarking uncovered several notable findings:
- Full-rank models remain highest-performing when optimally trained.
- Plain low-rank factorization yields surprisingly competitive perplexity for small models, with performance degrading for larger models but never entirely failing as previously thought.
- Restoring full-rankness in factorization- or optimizer-based methods (SLTrain, Fira) substantially reduces the performance gap compared to full-rank models.
- The newly introduced weight refactorization and momentum reset techniques further improved low-rank/SLTrain models, approaching performance of leading memory-efficient optimizers (e.g., for Llama 1B: SLTrain-restarts achieved 14.37 perplexity vs. 13.97 for full-rank, saving ~25% memory).
- Scaling law analysis reveals that final perplexity depends primarily on the total computation (FLOPs) rather than on specific model configuration.
These results lead to several important conclusions and directions for future work:
- Well-designed and well-tuned efficient pre-training methods can reach performance close to full-model training, though a minor gap persists for the largest models.
- Practical techniques like weight refactorization and momentum reset play a critical role in narrowing this gap.
- Future research aims to extend benchmarking across more models and datasets, and to develop further efficient pre-training techniques.